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PARALLEL PROCESSOR FOR MOTION ESTIMATOR 

This invention relates to video encoding and 
decoding, and in particular to the calculation of motion 
vectors in a video compression system such as MPEG-2 . 

The MPEG-2 video standard is defined in ISO/IEC 
13818-2 and is based on elimination of redundant video 
data to enable high quality picture information to be 
transmitted over a relatively narrow bandwidth channel, 
Video compression is achieved in a number of separate ways 
including intra-frame coding and inter-frame coding. 
Intra-frame coding reduces^- video data first by quantising 
discrete cosine transfer (DCT) coefficients of spatial 
data. The image to be coded is divided into a niomber of 
macroblocks each of 16 x 16 pixels and a different 
quantizing scale may be defined for each macroblock. 
Following quantisation lossless data reduction is applied 
by using Variable Length Coding (VLC) and Run Length 
Coding (RLC) to reduce the number of bits required to 
encode common patterns and frequently occurring values. 
The image to be encoded is divided into a number of 
macroblocks each of 8 x 8 pixels. Variable Length Coding 
and Run Length Coding is performed on 8 x 8 pixel blocks 
using a zigzag pattern to maximised redundancy. 

Inter-frame compression seeks to eliminate 
25 information which is redundant by virtue of it having been 
present in a past, or future image defined as an anchor 
frame. The anchor frame is a full resolution, full data 
picture. As the image will often contain portions which 
are moving from frame to frame, motion vectors are used to 
30 predict a present frame from an anchor frame. Motion 
vectors are assigned at. a macroblock level and the 
predicted frame is subtracted from the actual frame to 
form a difference frame which has a much lower information 
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context than the actual frame. The content of the 
difference frame will depend on the accuracy of the 
predicted frame. The predicted frame is developed from a 
IDCT quantised, decoded picture. 

5 Inter-frame prediction may be based solely on forward 

prediction from intra-frame coded images or other forward 
predicted frames, or be bi-directionally predicted from 
both a previous and a future intra-frame coded or forward 
predicted frame. Bidirectional coding necessarily means 
10 that the video input order must be changed so that the 
past and the forward anchor frames are known. 

The MPEG-2 standard provides a number of defined 
system configurations which are represented as levels and 
profiles as shown in table 1 below. 



20 



LEVEL/ 
PROFILE 


SIMPLE 


MAIN 


SNR 


SPATIAL 


HIGH 


HIGH 




1920x1152 
80 Mb/s 






1920x1152 
100 Mb/s 


HIGH 
1400 




1440x1152 
60 Mb/s 




1440x1152 
60 Mb/s 


1440x1152 
80 Mb/s 


MAIN 


720x576 
15 Mb/s 


720x576 
15 Mb/s 


720x576 
15 Mb/s 




720/576 
20 Mb/2 


LOW 




352x288 
4 Mb/s 


352x288 
4 Mb/s 







The MPEG-2 standard is designed to be scalable, that 
is decoders and encoders do not need to be of comparable 
quality to work together. It is desirable to design 
25 motion estimation processors which use corresponding VLSI 
technologies for the corresponding MPEG profiles. Where 
possible it is desirable that the processors should be on 
a single chip. However, where this is not yet possible, 
for the highest profiles and levels, it is desirable to be 
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able to operate a plurality of motion estimation 
processors in parallel. 



In addition to ensure that the maximum degree of 
video data compression can be achieved, within the 
5 confines of the MPEG-2 standard, it is desirable to be 
able to search the whole of the frame with a half-pixel 
accuracy. 

Computationally, the calculations on motion vectors 
is the hardest operation in coding video to the MPEG 

10 standard. The processes is illustrated in figure 1 in 
which the forward anchor frame is identified by the 
reference numeral 10, the backward anchor frame by the 
niomeral 14 and the current frame by 12. In figure 1 it 
can be seen that a given macroblock 16 is in a different 

15 position in each of the three frames, indicating a non- 
constant velocity movement. 

For each of the macroblocks in the current frame 12 
it is necessary to search for the matching macroblock in 
the full anchor frame with a half -pixel precision. The 
20 expression for the fully search algorithm for a single 
current frame macroblock is: 

N M 

(Z'X,G'Y)=Arg(min[^Y^\B(ZH,G^j)--I(X^ijy)\]) (1) 

XJ 1 = 1 y = l 

Where X,Y are the coordinates of the left upper 
corner of the anchor frame macroblock; 

Z,G are the coordinates of the left upper corner of 
25 the current frame macro block; 



(Z-X, G-Y) are the motion vector coordinates for the 
current macroblock being examined; and M,N are the 
macroblock dimensions in pixels. 
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Referring now to figure 2, a half pixel precision 
search can be understood as being a linear interpolation 
of adjacent pixels. Thus, in figure 2, A,B,D,E represent 
pixels of the original luminance matrix and h,v,c and the 
two unidentified points represent half-pixels. 

The half pixels are calculated by the following 
linear interpolations: 

Horizontal Interpolation h = (A+B) /2 (2) 

Vertical Interpolation v = (A+D) /2 (3) 

Central Interpolation c = (A+B+D+E) /4 . . . (4 ) 

As motion estimation requires the vectors of a number 
of macroblocks to be determined, and as video information 
is both spatial and temporal, parallel computing 
techniques are ideal for motion estimation. 

There are known in the art, a number of architectures 
which are aimed at increasing computation performance, 
whilst performing a full search algorithm (within the 
chosen search range all possible displacements are 
evaluated using, the block matching criterion, in contrast 
to logarithmic, telescopic and other searches) . 

In papers entitled "Arrray Architectures for Block 
Matching Algorithms" by T. Komarek, P. Pirsch, IEEE Trans. 
Circuits and Systems, Vol 36, NIO, Oct. 1989, pp. 1301- 
1308, and "Parameterizable VLSI architectures for the 
Full-Search and Block Matching Algorithm" by L. De Vos, M. 
Stegherr, IEEE Trans. Circuits and Systems, Vol 36, NIO, 
Oct 1989 pp. 1309-1316, there is described a two- 
dimensional systolic matrix which achieves high 
computational performance by a maximum degree of parallism 
in the performance of operations on a single anchor frame 
macroblock M+N. However, the architecture disclosed has 
the disadvantage that it only works with a given 



macroblock size and is not suitable for the processing 
with half pixel precision. In addition, the burst 
pipeline latency is such that a decrease of up to 50% in 
computational performance is possible. Moreover, the 
5 architecture described has a high data bandwidth 

requirement as it has a large number of external ports for 
data input and output. 

Various architectures have been proposed which are 
free from the disadvantages of the two-dimensional 

10 systolic matrix. A one-dimensional systolic matrix is 
disclosed in US 4,897,720 (Wu et al) and in a paper 
entitled ^'A family of VLSI designs for the Motion 
Compensation Block-Matching Algorithm" by Yang, Sun and 
Wu, IEEE Trans, Circuits and Systems, Vol 36, NIO, Oct 

15 1989, pp. 1317-1325. 

This architecture is based on performing pipelined 
computations for a single row of pixels in a macroblock. 
This reduces pipeline latency and, potentially, can 
calculate motion vectors to half pixel precision by using 
20 four devices operating in parallel. However, the 

architecture has the disadvantage of a lower computational 
performance compared to the two-dimensional systolic 
matrix . 

US 5,636,293 (Lin et al) discloses an architecture 
25 designed to increase the computational performance of the 
one-dimensional systolic matrix. A modular architecture 
is used which connects one-dimensional systolic matrices 
in tandem, allowing acceleration of calculations in the 
search window without increasing the number of data 
30 points. However, this architecture has the disadvantage 
that it does not provide half -pixel precision and 
computational performance is reduced as motion vectors for 
a single macroblock only can be searched for in the search 
window. 



us 5,719,642 (Lee) discloses a systolic matrix with 
global links for anchor frame data input into the 
processing elements row of a single macroblock row 
processing architecture. In addition, increases in anchor 
frame data memory can achieve 100% exploitation of 
hardware. However, the computation performance is limited 
by the number of MxN processing elements which operate in 
parallel. In addition, the architecture of US 5,719,642 
cannot calculate motion vectors with half-pixel precision. 

US 5,568,203 (Lee) discloses an architecture in which 
the motion estimator inputs data serially into a matrix of 
shift registers and simultaneously loads in parallel the 
anchor frame pixel data into the MxN matrix of processing 
elements. The matrix of processing elements provides 
serial calculations of the full search algorithm (equation 
1). Whilst this architecture has the advantage of 
minimising the number of input and output ports, and fully 
utilizes hardware resources, it cannot calculate motion 
vectors with half-pixel precision. In addition, 
computational performance is impaired as only the MxN 
processing elements operate in parallel. 

US 5,453,799 (Yang et al) discloses a unified motion 
estimator which performs MPEG-2 motion vector calculations 
on VLSI chips operating in parallel. However, 
computational performance is restricted to processing a 
single macroblock of the current frame in the search 
window. 

US 5,030,953 (Chiang), discloses a matrix of signal 
processors, consisting of M parallel groups of sub- 
matrixes with N parallel operating processing elements, 
which calculate the sum of subtractions of absolute values 
for a single row of macroblocks being compared. The 
architecture effectively utilizes hardware resources and 
minimises the number of I/O ports but has restricted 
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computational performance as it searches the motion vector 
of a single macroblock of the current frame and cannot 
calculate motion vectors with half-pixel precision . 

The invention aims to overcome or ameliorate the 
disadvantages with the systems described above. In its 
broadest form, the invention provides for the simultaneous 
comparison of S current frame macroblocks with the nK 
macroblocks of the anchor frame. Preferably, K is the 
number of macroblocks in the area of the anchor frame with 
the coordinates of the left upper corner, defined with 
single pixel precision, 4K is the number macroblocks in 
the area of the anchor frame having the coordinates of the 
left upper corner corresponding to half-pixel precision. 

More specifically, there is provided A parallel 
processor for estimating motion of a given portion of a 
current image frame with reference to a anchor frame 
comprising: an input for receiving current frame data; an 
input for receiving anchor frame data; a two-dimensional 
matrix of processing elements each for comparing a given 
area of the current frame with at least an area of the 
anchor frame wherein the matrix simultaneously compares S 
areas of the current frame with nK areas of the anchor 
frame, the matrix having dimensions pf KxS and n being an 
integer; means for selecting from the comparison, for each 
area of the current frame, an area of the anchor frame 
corresponding to the area of the current frame; and means 
for outputting data identifying the selected areas of the 
anchor frame. 

Embodiments of the invention have the advantage of 
increasing computation performance by adding additional 
unitary modules without requiring any modification of the 
initial architecture or control signals, thus the system 
is truly modular. Furthermore, embodiments of the 
invention have the advantage that VLSI technology may be 



used to make individual devices which can calculate motion 
vectors for the various MPEG-2 levels and profiles and for 
video with any parameters. 

A preferred embodiment of the invention may have the 
advantage that half-pixel precision is achieved using the 
full anchor frame search by comparing pairs of current 
frame and anchor frame macroblocks. 

An embodiment of the invention will now be described, 
by way of example only, and with reference to the 
accompanying drawings, in which: 

Figure 1, previously described, shows the movement of 
a macroblock between a past, present and future frame; 

Figure 2, previously described, illustrates half 
pixel points within a given block of four adjacent pixels; 

Figure 3, is a block schematic diagram of the 
architecture of a motion vector processor embodying the 
invention; 

Figure 4 shows one of the processing elements of 
figure 3 in greater detail; 

Figure 5 is an alternative realisation of the 
processing element of figure 4 for single pixel precision; 

Figure 6 shows, in more detail, one of the parallel 
pipelined modules P of figure 5; 

Figure 7 shows, in more detail, one of the input 
modules of figure 3; 



Figure 8 shows, in more detail, the memory unit of 
figure 7; . 
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Figure 9 is a block diagram of the Bi module of 
figure 3; 

Figure 10 is a block diagram of the input B module of 
figure 3; 

5 Figure 11 is a flow chart showing the steps in the 

anchor frame data priming process for generation of 
macroblock coordinates; 

Figure 12 shows, in more detail, the READ F step in 
figure 11; 

10 Figure 13 shows, in more detail, the WRITE T step in 

figure 11; 

Figure 14 shows, in more detail, the WRITE F step in 
figure 11; 

Figure 15 is a representation of a anchor frame 
15 divided into stripes for processing; 

Figure 16 shows an MPEG processor including a motion 
vector processor embodying the invention; 

Figure 17 shows, in block schematic form, the 
architecture of a multipoint videoconferencing system or a 
20 DVD system including the motion vector processor of figure 
16. 

Figure 18 shows, in block schematic form, the 
architecture of a videophone system including the motion 
vector processor of figure 16. 



25 



Figure 19 shows, in block schematic form, the 
architecture of digital video camera including the motion 
vector processor of figure 16. 
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Figure 20 shows, in block schematic form, the 
architecture of television or video encoder including the 
motion vector processor of figure 16. 

The architecture of figure 3 is based on the 
5 simultaneous comparison of S current frame macroblocks 
with K macroblocks of the anchor frame. This may be a 
portion of the anchor frame or the whole anchor frame 
depending on the picture size. The macroblocks are 
preferably 16 x 16 pel luminance pixel blocks although the 
10 MPEG 2 standard also supports 16 x 8 luminance pixel 
blocks or even 8 x 8 chrominance blocks. 

It will be appreciated that this approach differs 
from the prior art in which a single current frame 
macroblock is compared with the anchor frame macroblocks 
15 in the search window. The architecture of figure 3 can be 

realised on a single VLSI chip but, where K and S are such 
that a single chip is insufficient, individual modules can 
be connected together without requiring any 
reorganisation. 

20 In Figure 3, a plurality of K input modules 20 each 

receives anchor frame data Ih,Iv on respective inputs 
22,24. The output from the Input modules 20(1) to 20 (k) 
is identified as PI(1) to PI (k) and represents a 
transformed version of the input data. The outputs PI (1) 

25 to PI(k) are supplied to a matrix of KxS processing 

elements 26 identified as PEl . 1 to PEk.S in figure 3. 
Output PI(1) is supplied to the inputs of each of the 
processing elements in the row PEl. 4, that is, elements 
PEl.l, PEl. 2 and PELS in figure 3. Output PI (2) is 

30 supplied to each of the processing elements in the row 

PE2.X, that is elements PE2.1, PE2.2 and PE2 . S and so on, 
so that output PI(k) is input to elements Pek.l, Pek.2.... 
Pek.S as shown in figure 3. 
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The macroblocks B of the current frame are input on 
an input IB to an Input Module 30 which receives them and 
distributes the current frame macroblocks to S buffers B, 
shown as 32 (1) . . . 32 (S) in figure 3. The output of each 
current frame macroblock buffer B is provided as an input 
to each processing • element in a column. 

Thus, buffer Bl provides an input to processing 
elements PEl.l, PE2 . 1 and Pek.l and so on. 

The outputs of each of the processing elements PEl . 1 
to PEk.S are provided as inputs to a row of S comparator 
modules MIN 1 to MIN S identified by the numeral 34. As 
with the current frame input buffers 32 the comparators 
are connected to each processing element in a column of 
the matrix. Thus, comparator MIN{1) receives at its input 
the output of processing elements PEl.l, PE2 . 1 and PEk.l 
and so on. The comparators 34 process the inputs to 
provide X,Y coordinates of matching anchor frame 
macroblocks for given current frame macroblocks. The X,Y 
coordinate is the upper left hand coordinate of the block. 
The comparators then pass this coordinate data to the 
output block 36. 

It will be appreciated that data is input and output 
serially but all the processing is performed in parallel. 

Referring now to figure 4, one of the processing 
elements 2b is shown in greater detail. The element PEa.b 
has an input from current frame macroblock buffer Bb and 
an output to comparator MINb. The element receives 
comprises four identical parallel-pipelined processing 
modules 4 0 shown as Pc, Pv, Ph and PA which each have an 
output to a comparator MINP 42. Each of the parallel- 
pipelined processing modules 40 receives as its inputs, 
the output PB from the column macro block buffer, in this 
case PBb, and an Input PI from the row Input Module 22. 



The Input PI comprises four separate inputs Ic, Iv, Ih and 
lA which are input respectively to processing modules Pc, 
Pv, Ph and PA. The processing modules 40 perform parallel 
comparison of a single macroblock of the current frame 
provided from buffer B with four interpolations of a 
macroblock of the anchor frame having coordinates c,v,h 
and A as defined with reference to figure 2 earlier. 
Thus, the comparison is made with an anchor block having a 
given coordinate or coordinates off-set by a half-pixel in 
a horizontal, vertical or diagonal direction. It is the 
inclusion of these four pipelined processors in each 
processing element which gives the ability to estimate 
motion to half-pixel accuracy. 

Figure 5 shows an alternative processing element 26 
a.b that is suitable where only a single pixel precision 
is required. It is identical to the element of figure 4 
except that a single parallel pipelined Module 40 is 
required which receives a single input PI from the input 
module . 

A parallel-pipelined Module 40 is shown in more 
detail in figure 6. The module comprises M blocks AD 50 
operating in parallel, each of which receive as an input 
the output from the column current frame macroblock buffer 
together with an Input I. The Input I is provided from 
the Input Module and will be described in greater detail 
later. The output of each block AD 50 is passed to an 
adder-accumulator 60 whose output is the input to 
processor comparator MIN 42 in figures 4 and 5. 

The AD units each carry out a series of arithmetic 
operations on the incoming data. Thus, the units each 
include a Subtracter 51 which subtracts the value of the 
current frame macroblock data from the anchor frame 
macroblock data, an absolute value Unit 52 which converts 
the output of the Subtracter to an absolute value, an 



- 13 - 

accumulating adder 54 which adds the absolute value to the 
sum of earlier values, a first register 56 which holds the 
output of the adder 54 and whose output is fed back to the 
second input to the adder, and a second register 58 which 
5 receives the output of the first register 56 and thus the 
output of the accumulator adder. Thus, the blocks AD 
calculate the sum of absolute values of M differences with 
each block performing pipelined operation of sequential 
devices. The adder accumulator 60 receives the output of 

10 each second register 58 of each pipeline as an input to a 
multiplexer 62. The output of the multiplexer forms the 
input to an accumulator-adder 64 whose output forms the 
input to a first register 66 whose output is fed back to 
adder 54 to provide the second input. Thus, the outputs 

15 from the blocks 58 are summed and the output fed to a 

second register 68 whose output is the input to the 
comparator MINP 42. 

It will now be understood that the comparator MINP 42 
of each processing module sequentially compares the sums 

20 provided from each of the modules Pc, Pv, Ph, PA for the 
current frame macroblock and in its most simplistic form, 
defines -with half-pixel precision the coordinates of the 
anchor frame macroblock which has the smallest partial 
sum. It will be understood that the macroblock with the 

25 smallest partial sum is that which corresponds most 

closely to the current frame block under consideration. 
In many applications it will be more appropriate to set a 
threshold for the comparison. Higher thresholds may be 
set. As the threshold increases so too does the 

30 likelihood that there will be more than one coordinate 

value which will reach that threshold value. In that case 
the MPEG 2 standard provides that the decision may be made 
on the basis either of the first macroblock within the 
threshold value or the smallest value of all. If a 

35 macroblock provides no coordinate value within the 

threshold, as may be the case, for example, where there is 
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a scene change, that macroblock is intraframe coded and 
the remaining macroblocks are inter frame coded. This 
means that the bit rate reduction process is not abandoned 
purely because one block cannot be matched - 

5 It will be understood that the pipelines AD could be 

implemented in a variety of other ways , 

It will also be understood that it is ideal to 
process the whole of the frame in parallel but this is not 
necessary- The amount of the frame that is processed in 
10 parallel will depend on the Level/Profile being used and 

the environment in question. A procedure for optimising 
the architecture of the processor is described later. 

Turning now to figure 1, the Input Module I is shown 
in more detail. The module comprises the anchor frame 
15 buffer 70, shown as Memory Unit I in figure 7 and M 

processing blocks 72 SI to Sm together with an adder 74 
and a delay line 76. The anchor frame buffer 70 is 
controlled by a control unit 78. 

The purpose of the processing blocks 72 is to provide 
20 from the input data the necessary additional data to 

perform calculations with half pixel precision. Thus, the 
processing blocks S 72 provide the Ic, Iv, Ih, lA data 
inputs to the parallel pipelined processing modules 40 of 
the processing elements. Again it will be understood that 
25 if the embodiment of figure 5 is adopted without half- 

pixel precision, the processing blocks of figure 7 are not 
necessary. 

Referring back to figure 2, four points A,h,v,c are 
represented in the square. These points are required to 
30 operate at half pixel precision. Luminance data Y 

corresponding to these points is the input to processing 
modules 4 0 as mentioned above. Each of the blocks 72 



comprises a delay L 80, an adder Sh 82 with delay Lh 84, 
an adder Sv 86 with delay Lvj 88 and LVsBO and adder Sc92. 
Adder Sh82 perforins the horizontal interpolation of 
equation (2) being half the sum of luminance pixels A+B in 
figure 2 and thus the delay 84 is of a length equal to the 
pixel period. The output of adder 82 is the luminance 
value at point h. Adder Sv performs the vertical 
interpolation of equation (3) being half the sum of the 
liiminance pixels A+D in figure 2. Adder Sc 92 performs 
the central interpolation of equation 4 to calculate the 
luminance at point C in figure 2, Delays L, Lh and Lvj all 
provide timing adjustment for data output on the bus PI. 
As can be seen from figure 7, the outputs Ic, Iv, Ih and 
lA are comprised of lines ICi^ICs.-. IcM etc, with one line 
being provided by each of the blocks SI, S2...SM. 

Summarising the above, the input module takes the 
anchor frame data and forms the A,h,v and c data for each 
of M inputs. The A value is a simple delayed version of 
the input whereas h,v and c are obtained by performing 
equations (2) , (3) and (4) as described in relation to 
figure 2 . 

The additional adder Sv 74 and delay LVj 7 6 shown in 
figure 7 are required as the value h relative to the last 
Pixel A to be calculated requires knowledge of the next 
Pixel B. This is provided by output M+1 from the buffer 
70. 

Figure 8 shows the input buffer 70 of the input 
module in more detail. Data inputs Ih/ Iv are provided to 
first and second data registers 100, 102. Data from these 
registers is transferred to a multiplexer ID4 according to 
an anchor frame data priming algorithm which will be 
described. The multiplexer outputs data to a plurality of 
M+1 two part memory blocks II to IM+1 106 which store M+1 
columns of anchor frame data. The output of the 



multiplexer and the memory blocks 106 are both controlled 
by signals AR, AWT, AWF from the Control Unit 78 (figure 
7) . Data is output from the memory blocks to a switch 
matrix MXI . 1-MXI .M+1 108 having M+1 inputs and M+1 
outputs. The output of the Switch Matrix is the -M+i lines 
to the M processing blocks- S of figure 7. 

The control unit 78 in figure 7 operates according to 
the anchor frame data priming algorithm and generates the 
anchor frame macro block coordinates which are sent to the 
processing elements 26 for processing. 

Referring back to figure 3, the current frame 
macroblock buffer 30 comprises M memory blocks with N 
cells. The organisation of the buffer 30 enables 
simultaneous storage of current frame macroblocks and the 
reading and loading of the next macroblock of the current 
frame. The memory blocks and registers 32 receive data 
serially. The organisation of the current frame input 
buffer is illustrated in figures 9 and 10, 

In Figure 10 it will be seen that the input B data is 
passed to the input B unit register B and a demultiplexer, 
the output of which passes the data to the Buffers Bl to 
Bs. As can be seen from Figure 9, each of the B buffers 
comprises a series of memory blocks 1 to M each having N 
cells which are duplicated and which blocks have outputs 
to a respective one of M multiplexers whose outputs are 
passed to the processing elements of a given column. 

The comparator modules 34 MINl-MINS of figure 3 
sequentially compare the partial sums from parts PEl . I to 
PEk.i and define the coordinates of the anchor frame 
macroblock for which the threshold criteria are achieved. 
These coordinates are passed to the output block 36 for 
output- 
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Dat:a loading algorithm 



Figures 11 to 15 show the steps in the anchor frame 
priming process to generate the macroblock coordinates - 
Figure 11 is an overview of the process and figures 12, 13 
and 14 show, respectively, the READ F, WRITE T and WRITE F 
steps in more detail. Figure 15 is a schematic 
representation of a anchor frame. 

Referring first to figure 15, anchor frame 200 having 
dimensions AxC is divided into K partial cross stripes 
202a... k with dimensions Axd, where d= ( (C-N) /K+N) , C is 
the vertical frame dimension, and K is the number of 
modules Input I. So, for instance for frame dimensions of 
352x288, K=4 and N=16 and the frame is divided into 4 
stripes each of dimensions 352x84. The first stripe 202a 
with upper left angle coordinates (1,1) will be loaded and 
processed in module Input II. The second stripe 202b with 
coordinates (1,68) will be loaded in module Input 12, the 
third stripe 202c with coordinates (1,136) will be loaded 
in module Input 13 and the forth one 202k with coordinates 
(1,204) will be loaded in module Input 14. The stripes are 
loaded in sequence. All stripes are processed in parallel 
and in the same manner. 



Before the describing the loading algorithm, the 
following terms will be defined: field F and column T. 
Field F 204 is part of a stripe that represents number's 
matrix with the dimensions (M+l)xd. Column T 206 is part 
of stripe that represents the number's matrix with the 
dimensions Ixd. 

Each of memory modules II, I2,...,IM+1 (Fig. 8) 
comprises two banks each having a volume d, one of which 
is using for. the processing, the current operational bank, 
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and the other is used for the loading the next portion of 
data. Field F is loaded in the bank that currently is used 
for loading. Each column T of the field F is loaded in the 
corresponding memory module. This operation is denoted 
Write F - field load and is shown in figure 14. 



The algorithm for the Write F operation provides 
sequential loading of columns T of field F in 
corresponding memory modules. In each memory module, 
10 column T is loaded sequentially according to the address 
AWF value. 



15 



After the field F is loaded in the first memory bank, 
the data in this bank is ready for processing. The field F 
of the next anchor frame will be loaded further in the 
second memory bank. In the operational memory bank two 
operations are performed: the field F read operation 
denoted Read F and the column loading operation denoted 
Write T. These to operations are illustrated in figures 
20 12 and 13 respectively. 



The Read F operation represents the sequence of M+1 
simultaneous operand read operations from M+1 memory 
modules according to the common address ARR. The initial 
address AR is equal to zero. After N read operations the 
initial address increments by one and the next N read 
operations are performed, and so on until the initial 
address becomes greater then d-N. 



25 



30 



After the Read F operation, data of the left column 
can be replaced with the data of the next to the right 
side of the field F right column. For instance, after the 
Write F procedure, the operational memory bank holds 
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columns data with coordinates Y=l, Y=M+1 . After the 
first Read F operation, data of the first column with 
coordinate y=l can be replaced with the data of the next 
right column with coordinate y=M+2 . This process is 
5 performed sequentially for the whole stripe. 



Referring to figure 13, the Write T algorithm for the 
column T loading operates as follows. Firstly the 
coordinate Y is incremented by one and the new value is 

10 compared with the C value (frame vertical dimension) . If Y 
< C, the column loading operation continues. Write address 
AWT is calculated and then the AWT value is compared with 
the value of current read address AR (from Read F 
operation) . This comparison is necessary because read and 

15 write operations are performed from and to the same memory 
bank and Read F operation should provide the correct 
coliimn T data reading. If AWT < AR and there is ready 
signal from register Rinl the data is loaded to the 
address AWT and so on until j<d. 

20 

The whole loading algorithm for the loading of one 
stripe of reference frame is represented at the Fig. 11. 
Firstly field F is loaded in memory through the Write F 
operation. This operation is synchronized by a ready 
25 signal from register Rin2 . The finish of this operation is 

synchronized by the end of loading of S current frame 
macroblocks in module Input B. 



Then for each stripe initial coordinates are set: 
30 X=Xf and Y=l . Matrix switch 108 (See Fig. 8) provides 

direct data transfer (MX=0) . The address of the column 
being loaded is set to one: T=l. Then, three parallel 
processes are being performed in the operational bank: 
Write F; Read F; and Write T. The last two processes is 
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synchronized by read address AR. The finish of these 
processes also is synchronized. If Write T is not 
outputting signal END Y, then X=Xf, the column loading 
address is incremented by one (T= (T+1 ) modM+1 ) , the matrix 
5 switch 108 is switched to transfer data according to the 
coliimn T address (MX= (MX+1 ) modM+1 ) and two parallel 
processes continue to perform until the signal END Y 
appears. Then the algorithm waits for the finish of field 
loading (Write F operation) and for the finish of loading 
10 of S current frame macroblocks and so on. 



In summary the embodiment described provides parallel 
processing of calculations, anchor and current frame data 
input and motion vector output through a matrix of 
processing elements and input modules for the anchor frame 

15 and current frame data and an output module for the motion 
vectors. Motion vectors are calculated in parallel for a 
set of current frame macroblocks and, preferably, to a 
half pixel precision. Furthermore, M s\ams of absolute 
difference are calculated in parallel in the processing 

20 elements and a single macroblock row of 16 pixels is 

processed in parallel. Pipeline processing is provided 
for in the calculation of the sum of absolute values of 
differences, the summing of those sums and the comparison 
of those sums to determine the closest anchor frame 

25 macroblock- 



The embodiment has been described with reference to 
forward predicted coding. It will be appreciated that it 
is equally applicable to bidirectional coding. The latter 
is achieved by performing the comparison operation for the 
30 current frame twice, once with the forward anchor frame 

and once with the backward anchor frame and then comparing 
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the results of the two. The best of the two is then taken 
as the predicted frame. 

It will also be appreciated that the motion estimator 
can operate on a whole frame of current macroblocks or, 
5 where the number of blocks is too high, can process the 
frame in a niimber of passes. An alternative would be to 
use two or more processors, however there is adequate time 
for at least two passes. 



The motion estimator described herein may be used in 

10 any environment in which MPEG 2 coding is required. This 
includes, for example, video signal encoding for broadcast 
or broadcast quality pictures for subsequent narrowcast or 
recordal, multipoint tele- or video conferencing 
equipment, DVD video encoders, video cameras including 

15 broadcast quality cameras and camcorders. For 

applications such as multipoint teleconferencing, it is 
not practical for the search to be based on a full anchor 
frame and it is suitable to define a search window. As the 
amount of movement is likely to be small, it is believed 

20 that this approach is satisfactory and can give very 

significant improvements over presently available systems 
enabling rates of up to 15 frames per second on 
conventional ISDN links with a data rate of 128kB. In 
other applications the statistical approach of the whole 

25 frame search is more appropriate. It will be understood 

that the estimator as described affords the possibility of 
either solution, depending on the application. 

Figures 16 to 20 show examples of how the 
embodiment of the invention described can be used in a 
30 variety of different applications, each using MPEG based 
video compression . 
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in figure 16, there is illustrated an MPEG processor 
248 which is the core part of all the applications. The 
MPEG processor comprises a programmable DSP engine 250 to 
support the basic functions of MPEG video coding and 
compression and decoding and decompression including DCT, 
IDCT, Q, Q'\ VL coding and so on. The Motion Detection 
Processor 252 is a parallel-pipelined processor embodying 
the present invention. The complexity of the MDP engine 
252 will depend on the demands of real-time video 
sequences being processed for and particular MPEG level 
and profile. Computational performance of the DSP engine 
250 should also be consistent with the particular 
application. 



The MPEG processor proposed can be implemented using 
existing DSP processors, the example, TMS320C62 DSP 
processor. Thus it is necessary only to develop the MDP. 
This two chip solution can be used for the lower MPEG 
profiles and levels. For higher MPEG levels and profiles 
it may be necessary to develop a more powerful DSP engine. 
It is possible to develop a single chip solution for the 
MPEG processor due to its general structure as outlined 
above. In the case of a single chip solution, the 
processor will have one input Data bus and a single 
interface to the external RAM. 



Figure 17 illustrates how an embodiment of the 
present invention may used in a video conference system. 
At present, videoconferencing systems are being developed 
mainly on a PC platform. The embodiment of figure 17 
frees the Pentium (or other) PC processor from the hard 
computational task of determining motion vectors. In 
figure 17, the system controller 2 60 communicates with a 
PCI bus through a PCI interface 2 62, and with an MPEG 
processor 264 as illustrated in figure 16 and embodying 
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the present invention over the system bus. The MPEG 
processor is coupled to a RAM 266 with which it can 
exchange data. Depending on the choice of Video and Audio 
front-end devices 268, 270, and the MPEG processor 
realisation, the front end devices may either be attached 
to the system bus (solid line in figure 17) , or connected 
through the system controller 2 60 (shown as dotted lines 
in figure 17) . 

The MPEG processor encodes digital video and audio 
data from the front end devices. The MPEG data stream is 
output through the system controller and the PCI bus and 
can be further transported to the destination through the 
communications capabilities of the PC. 

The MPEG processor 262 also decodes incoming audio 
and video data which is received as an MPEG data stream on 
the PCI bus. Decompressed audio and video data is further 
available to the user through the PCI bus and the 
corresponding PC capabilities such as the monitor and 
sound blaster. 

The use of a PC or other computer with the MPEG 
acceleration board will allow multi-point 
videoconferencing systems to be built by using the 
computational resources of existing processors such as the 
Pentium and Pentium II (TM) to decode additional input 
MPEG channels. 

The system outlined above is suitable for a number of 
videoconferencing systems such as point-to -point QCIF 
videoconferencing, multipoint QCIF videoconferencing and 
low-bit CIF videoconferencing on ISDN lines. The 
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processing of audio data is optional and may be performed 
using PC software or by the DSP engine. 



A DVD system embodying the present invention has the 
same architecture as shown in figure 17. Differences may 
exist in the MPEG processor due to the need to compression 
conforming to CCIR Rec 601 standard. To provide the 
corresponding MPEG level and profile, a more complex MDP 
engine is required. As the system is intended only to 
compress video and audio and to write an MPEG stream on 
DVD ROM through the PCS capabilities, the increase in DSP 
complexity, if any may be negligible. 



Figure 18 shows how an MPEG processor embodying the 
present invention and as shown in figure 16 may be used in 
a videophone system. The system is based on a QCIF 

15 videoconferencing system and is similar to the system 

illustrated in figure 17 except that it requires audio and 
video back end devices 272, 274 which provide digital to 
analog conversion of decompressed MPEG data. In addition 
the system controller interface must include a modem 

20 interface 27 6 for exchange of digital MPEG data between 
the transmitting and receiving points. In this system, 
audio data processing is necessary. 

Figure 19 shows how an MPEG processor embodying the 
present invention and illustrated in figure 16 may be 

25 used, in conjunction with DVD technology for MPEG data 
storage to develop a digital video camera. This 
realisation relies on the availability of rewritable DVD- 
ROMs with sufficiently good speed characteristics. The 
arrangement is similar to that of figure 18 except that 

30 the audio and video back end devices are optional if play 
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back is required, that a DVD controller 278 communicates 
with the system controller and that no modem is needed. 



Figure 20 shows an example of how the MPEG encoder 
embodying the invention and illustrated in figure 16 may 
5 be used as a television MPEG encoder. The circuit 

illustrated may be used in broadcasting equipment to 
encode a single television channel. The same 
configurations may be used for standard definition and 
HDTV with the difference being in the complexity of the 

10 MPEG processors. Present fabrication techniques can build 
a processor for standard definition on a single chip. At 
present several chips operating in parallel are required 
to support HDTV although it is envisaged the a single chip 
solution will be possible shortly as fabrication 

15 techniques improve. 

Procedure for the architecture parameters definition 

It will be appreciated that for different MPEG levels 
and profiles and for different applications, varying sizes 
of processing matrix will be required. The following 
20 section sets out how the parameters of the architecture 
may be defined. 

For real-time motion vectors calculations it is 

necessary to define the following architecture parameters: 

Niomber of processing elements K*S; 
25 Number of input ports for module Input B - LB; 

Number of input ports for module Input I - LIv and LIh; 

Number of memory modules cells - D; 

Number of output ports for module Output V - LV; 

Number of the processing elements in horizontal direction 
30 - S; 

Number of the processing elements in vertical direction 
- K. 



The architecture parameters depend on the values of 
the following primary data: 
A - frame horizontal dimension; 

C - frame vertical dimension; 

p - number of bits for pixel presentation; 

MxN - macroblock dimensions; 

Tc - time interval for single operation on pixel in 

the pipeline and memory read time interval; 
Tio - time interval for the external single 

information bit input /output; 
T - time interval for the calculation of the 

motion vectors for the full current frame; 
Lmax - maximal ntimber of input/output ports. 

Calculation of K*S value 

Calculation of K*S matrix dimensions necessary for 
real-time operation, that is the total number of 
processing elements is based on the following expression: 

T= (A/M*C/N* (A-M) * (C-N) *N*Tcy /K*S 

(1) . 

This expression means that during time interval T it 
is necessary to perform block matching procedures on 
A/M*C/N current frame macroblocks with (A-M) * (C-N) anchor 
frame macroblocks using matrix of K*S parallel processing 
elements. Block matching procedure for two macroblocks 
requires time interval N*Tc as the only read operation of 
N macroblock rows is performed sequentially and all other 
necessary operations for the block matching procedure are 
performing in parallel-pipelined mode. 

Therefore, value of K*S is calculating from expression 

(1) : 

K*Si (A*C* (A-M) * (C-N) /M) * (Tc/T) 

(2) . 

Maximal value of S is defining by the number of 
anchor frame macroblocks: 
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S„^x= A/M*C/N 



(3) . 

Minimal value of K in this case is: 
K,i„=K*S/S^,= (A-M) * (C-N) *N* (Tc/T) 

5 (4). 

Calculation of Input/Output ports number 

Value of LB is calculating from the following 
expression: 

T^A*C*p* (Tio/LB) 

10 (5). 

This expression means that during time interval T it is 
necessary to perform the loading of whole current frame data 
into processor. So, the value of LB is: 

LB^A*C*p* (Tio/T) 

15 (6). 

Value of LV is calculating from the following 
expression: 

T^2* (A/M) * (C/N) *log2 A* (6io/LV) 

(7) . 

20 This expression means that during time interval T it 

is necessary to output X,y coordinates of all calculated 
motion vectors for current frame. So, the value of LV is: 

LV^2* (A/M) * (C/N) *log2 A* (Tio/T) 

(8) . 

25 Value of LIv is calculating from the following 

expression: 

Ta (M+l)*p* A*CVS*M*N* (Oio/LIv) 

(9) . 



This expression means that during time interval T it 
is necessary to load memory volume equal to (M+l)*p*C by 
(A*C) / (S*M*N) times. So, the value of LIv is: 

Livi (M+1) *p* A*CVS*M*N* (Oio/T) 

(10) . 

Value of LIh is calculating from the following 
expression : 

( (A-i-1) *a*f}^*A) / (S*M*N) * (Oio/LIh) 

(11) . 

This expression means that during time interval T it 
is necessary to load memory volume equal to (A-M-l)*p*C by 
(A*C)/ (S*M*N) times. So, the value of LIh is: 

LIhs ((A-i-1) *e*f52*A) / (S*M*N) * (6io/T) 

(12) . 

Calculating of D value 

D is the length of column that is loading in K Input I 
modules. The D value could be calculated from the following 
expression : 

D*p* (Tio/LIh) s (D-N) /K*N*Tc 

(13) . 

This expression means that during time interval for 
loading the column with the length equal to D it . is 
necessary to load (D-N) /K*N operands. Therefore, D is 
calculating from the following expression: 

N/ (1- (p*Tio/N*Tc) * (K/LIh) ) 

(14) . 

Using the expression (12) for the Lih final expression 
for D value is: 

C* (1-1/ (A-M) ) / (1-1/ (A-M) *C/N) 

(15) . 
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Since the value of (1-1/ (A-M) ) / ( 1-1/ (A-M) *C/N) is 
greater 1, D>C and it is in contradiction with the loading 
algorithm. So, it is necessary to choice D=C. In this case 
the expression (12) will be: 

5 Llh^ ( (A-I) *6*n2*A) / (S*M*N) * (6io/T) 

(16). 

From expressions (10) and (16) it is possible to define 
Sn^in with the restrictions on the Input/Output ports nximber: 

Snan=( ( (A+D *A*C2*p) / (M*N) ) * (Tio/T) * (1/Lmax-LB-LV) 

10 (17). 

Calculation of K and S 

By increasing the S value it is possible to decrease 
the number of Input/Output ports. On the other hand the 
increasing of K value could lead to the decreasing of 
15 processor's hardware . 

Suppose : 

H - necessary hardware for the PE implementation 
according to the Fig. 4; 

gs*H - necessary hardware for the Input B module 
20 implementation; 

gk*H - necessary hardware for the Input I module 
implementation . 

Coefficients gs and gk depend on the particular modules 
implementations . 

25 So, the total hardware for the implementation of 

processor for the motion vectors calculations according to 
Fig. 3 could be minimized using the following expression: 

K*S+K*S/K*gs*H+K*gk*H-min (18). 

In order to minimize hardware it is necessary to 
30 differentiate by K expression (18) and to equate result to 
zero. In this case the optimal value K^p^ equal to: 



-so- 



lo 



15 



20 



Kopt= {K*S*gs/gk) ''^^ ( (A*C* (A-M) * (C- 
N) /M) * (Tc/T) *gs/gk) 
(19). 

If K,p, > K^in then K=K,p,; otherwise K=K,i, and S is 
calculating from (2): 

S= {A*C* (A-M) * (C-N) /M) * (Tc/T) /K (20) . 

In the case of Input/Output ports number restrictions 
applying further it is necessary to perform the following 
final calculations : 

If S ^ S^i„ then S=S; otherwise S=-S,in and K should be 
recalculated from (2) : 

K= (A*C* (A-M)* (C-N) /M)* (Tc/T) /S (21) . 

Table 2 below represents the results of applying of 
the optimization procedures for various video formats. In 
all calculations the following initial parameters were 
used: 

p=8; 

N=16; 

M=16; 

T=0.0166 sec; 
gs=0.2; 
gk=1.2; 
D=C. ^ 
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Variations and modifications to embodiments described are 
possible without departing from the invention and will occur 
to those stated in the art. However, the invention defined 
solely by the claims appended hereto. 
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CLAIMS 

1. A parallel processor for estimating motion of a given 
portion of a current image frame with reference to a 
anchor frame comprising : 

5 an input for receiving current frame data; 

an input for receiving anchor frame data; 
a two-dimensional matrix of processing elements each 
for comparing a given area of the current frame with at 
least an area of the anchor frame wherein the matrix 
10 simultaneously compares S areas of the current frame with 
nK areas of the anchor frame, the matrix having dimensions 
of KxS and n being an integer; 

means for selecting from the comparison, for each 
area of the current frame, an area of the anchor frame 
15 corresponding to the area of the current frame; and 

means for outputting data identifying the selected 
areas of the anchor frame. 

2. A parallel processor according to claim 1, wherein 
the matrix simultaneously compares S areas of the current 

20 frame with 4K areas of the anchor frame. 

3. A parallel processor according to claim 1 or 2 
wherein the areas of the anchor frame and the current 
frame are all cosized macroblocks. 

4. A parallel processor according to claim 3, wherein 
25 the macroblocks comprise 16x16 pixels. 

5. A parallel processor according to claim 4, wherein 
the pixels are luminance pixels. 

6. A parallel processor according to any preceding 
claim, wherein each processing element comprises a 

30 comparator and at least one parallel pipeline processor, 
wherein the at Jeast one parallel pipeline processor 



receives current frame image area data and anchor frame 
image area data and outputs a sum of absolute differences 
between the current frame image area data and the anchor 
frame image area data to the comparator. 

5 7. A parallel processor according to claim 6, wherein 
the parallel pipeline processor comprises a plurality of 
pipeline stages and a pipeline accumulating adder for 
adding the outputs of each of the pipeline stages. 

8. A parallel processor according to claim 1, wherein 
10 each of the pipeline stages comprises a subtracter for 

providing a differential output from anchor and current 
frame data inputs, an absolute value calculator, an 
accumulator adder for adding calculated absolute values 
and first and second registers for holding the accumulated 
15 absolute values. 

9. A parallel processor according to claim 8, wherein 
the pipeline accumulating adder sums the outputs of the 
second registers of each pipeline stage. 

10. A parallel processor according to claim 7, 8 or 9, 

20 wherein the acciamulating adder comprises a multiplexer for 
receiving data inputs from the pipeline stages, an adder 
for summing data inputs, a first register for holding the 
output of the adder, wherein the adder receives as a 
further input the content of the register, and a further 

25 register for receiving the output of the first register 
for output to the comparator of the processing element. 

11. A parallel processor according to any of claims 6 to 
10, wherein each processing element comprises four 
parallel pipeline processors , the outputs of which are 

30 input to the comparator, wherein the four parallel 

pipeline processors perform parallel comparison of. a 
single area of the current frame with four areas of the 
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anchor frame separated vertically and/or horizontally by 
half a pixel. 

12. A parallel processor according to any preceding 
claim, wherein the anchor frame data input comprises a 
anchor frame buffer and a plurality of parallel processing 
blocks for processing simultaneously pixels of a row of 
the frame area, and a control unit. 

13. A parallel processor according to claim 12, wherein 
each parallel processing block comprises a first means for 
generating a value of a pixel at a position offset 
horizontally half a pixel from an input pixel position, a 
second means for generating a value of a pixel at a 
position offset vertically half a pixel from said input 
pixel position, and a third means for generating a value 
of a pixel at a position offset vertically and 
horizontally half a pixel from said input pixel position. 

14. A parallel processor according to claim 13, wherein 
said first means comprises an adder and a first delay 
means and performs the function h = (A+B) /2 where h is the 
half pixel offset value and A and B are horizontally 
adjacent input pixels. 

15. A parallel processor according to claims 13 or 14, 
wherein said second means comprises an adder and a delay 
means and performs the function v = (A+D) /2 where v is the 
half pixel offset value and A and D are vertically 
adjacent input pixels. 

16. A parallel processor according to claims 13, 14 or 
15, wherein said third means comprises an adder and 
performs the function c = (A+B+D+E) /4 where C is the value 
of the offset pixel and A,B,D and E are horizontally and 
vertically adjacent pixels. 
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17. A parallel processor according to any of claims 12 to 

16, wherein the control unit controls the anchor frame 
buffer and outputs reference area coordinate values to the 
processing elements . 

5 18. A parallel processor according to any of claims 12 to 

17, wherein the anchor frame buffer comprises M+1 memory 
blocks, where M is a dimension of the anchor frame area, 
and a switch matrix receiving data input from the memory 
blocks. 

10 19. A parallel processor according to any preceding 

claim, wherein the anchor frame data input comprises an 
input module for receiving and distributing input data, 
and S current frame area buffers which receive current 
frame area data from the input module. 

15 20. A parallel processor according to claim 19, wherein 
the input module comprises a plurality of memory blocks, 
each block comprising a pair of memory banks each having a 
plurality of memory cells. 

21. A parallel processor according to claim 20, wherein 
20 the number of memory blocks in the anchor data input 

module is M and the number of cells in each memory block 
is N where MxN is the dimension of the current frame area. 

22. A parallel processor according to any preceding 
claim, wherein the selecting means comprises S 

25 comparators, said comparators also defining the 

coordinates of the anchor frame areas corresponding to a 
given current frame area. 

23. A parallel processor according to any preceding claim 
wherein n is 1 or 4. 
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24. A video processor comprising a programmable DSP 
engine and a motion detection processor, wherein the 
motion detection processor comprises a parallel processor 
according to any preceding claim. 

5 25. A video processor according to claim 24, wherein the 
video processor is an MPEG processor. 

26. A video encoder comprising a video processor 
according to claim 24 or 25. 

27. A video encoder according to claim 26, further 
10 comprising a system controller communicating with the 

video processor, a random access memory communicating with 
the video processor, an audio front end and a video front 
end, wherein the audio and video front ends communicates 
with the video processor either via the system bus or via 
15 the controller. 

28. A multipoint teleconferencing apparatus comprising a 
video processor according to claim 24 or 25. 

29. A multipoint teleconferencing apparatus according to 
claim 28, further comprising a system controller 

20 communicating with the video processor, a further 

processor, an interface between the system controller and 
the processor, a random access memory communicating with 
the video processor , and a video front end, wherein the 
video front end communicates with the video processor 

25 either via the system bus or via the controller. 

30. A multipoint teleconferencing apparatus according to 
claim 29, wherein the further processor is a PC. 

31. A DVD system comprising a video processor according 
to claim 24 or 25. 
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32. A DVD system according to claim 31, further 
comprising a system controller communicating with the 
video processor, a further processor, an interface between 
the system controller and the processor a random access 
memory communicating with the video processor , and a video 
front end, wherein the video front end communicates with 
the video processor either via the system bus or via the 
controller . 

33. A DVD system according to claim 32, wherein the 
further processor is a PC. 

34 . A digital videophone system comprising a video 
processor according to claim 24 or 25- 

35. A digital videophone system according to claim 33, 
further comprising a system controller communicating with 
the video processor, a modem interface communicating with 
the system controller, a random access memory 
communicating with the video processor, a video front 
end, an audio front end wherein the video front end 
communicates with the video processor either via the 
system bus or via the controller, an audio back end and a 
video back end, wherein the aduio and video back ends are 
connected to the system controller . 

36. A digital video camera comprising a video processor 
according to claim 24 or 25. 

37. A digital video camera according to claim 36, further 
comprising a system controller communicating with the 
video processor, a random access memory communicating with 
the video processor, an audio front end, a video front 
end, wherein the audio and video front ends communicates 
with the video processor either via the system bus or via 
the controller, and a DVD controller. 
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38, A procedure for defining the architectural parameters 
of a parallel processor for estimating motion according to 
any of claims 1 to 23, comprising the steps of: 

calculating the number of processing elements K*S 
5 where K*S is a matrix having K columns and S rows, 

comprising determining the time period required to perform 
block matching procedures on current frame macroblocks; 

calculating the number of input/output ports from the 
time interval required to load current frame data and 
10 anchor frame data for procesing and to output coordinate 
data of calculted motion vectors; 

calculating the size of memory cells required to 
enable a column of inputs to be loaded in a given time; 
and 

15 calculating the number of processing elements in both 

the vertical and horizontal directions by calculating the 
optimum value of K based on the processor hardware 
necessary to implement the processor. 
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PARALLEL PROCESSOR FOR MOTION ESTIMATOR 

A parallel processor for estimating motion between 
macroblocks of a current frame and an anchor frame 
comprises a KxS matrix of processing elements (26) , K 
input modules (20) each for inputting anchor frame data to 
a row of processing elements, an input module (30) for 
inputting current frame data, S current frame buffers each 
for inputting current frame macroblock data to each of a 
column of processing elements, S comparator modules each 
for comparing the output of each of a column of processing 
elements, and an output module for outputting coordinates 
of anchor frame macroblocks most similar to given current 
frame macroblocks. The S current frame macroblocks may 
each be compared simultaneously compared with nK anchor 
frame macroblocks thereby significantly reducing 
processing time. 



This Page Blank (uspto) 



